All the results were filtered by the adj.p-value < 0.05 and logFC >= 0.3
## An object of class Seurat
## 15327 features across 6961 samples within 1 assay
## Active assay: RNA (15327 features, 0 variable features)
## 2 dimensional reductions calculated: umap, pca
For single cell methods I am going to consider both Seurat and Scanpy
For Scanpy I take in account the methods: - t-test_overestim_var - wilcoxon
Unfortunately the logistic regression provided by scanpy does not give LogFoldChanges and will not be taken in account.
The t-test_overestim_var recognized as significative almost all the genes and include almost all the genes from the wilcoxon test. Nolw let’s evaluate the expression of the genes.
Top upregulated genes in Erythrocytes:
Top downregulated genes in Erythrocytes:
Genes with lower abs(logFC)
The result’s in terms of LogFC looks really odd. Let’s see the values for this genes:
## Symbol pvals_adj logfoldchanges abs_lfc scores
## 68 KRT1 0.000000e+00 29.73123 29.73123 48.64429
## 113 LINC00570 3.816327e-227 28.53994 28.53994 35.04078
## 131 ARG1 2.312735e-192 28.81118 28.81118 31.80785
## 13375 ARPP21 1.717660e-226 -29.04441 29.04441 -34.57604
## 13480 CD72 1.808806e-242 -28.93440 28.93440 -35.96811
## 13887 MME 0.000000e+00 -29.52419 29.52419 -44.01670
## Aberrant.erythroid Pro.B.cells
## 68 0.4846416 0.0000000
## 113 0.3146137 0.0000000
## 131 0.2832765 0.0000000
## 13375 0.0000000 0.2891921
## 13480 0.0000000 0.3130016
## 13887 0.0000000 0.4122525
Apparently Scanpy has a bias evaluating the logFC of
genes that are completely missing in one of the conditions.
But The genes are correctly plotted in the report. So I will check the expression by
the z-score instead of the LogFC:
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
Better result for the extremes logFC (scores), bt still not clear results for the low scored genes. Indeed those genes are expressed by less than 1% of cells:
## Symbol pvals_adj logfoldchanges abs_lfc scores Aberrant.erythroid
## 816 MIR133A1HG 0.04988015 -2.444078 2.444078 -1.980822 0.0006205399
## 817 UACA 0.04972146 -1.606474 1.606474 -1.982030 0.0009308098
## 818 CEP170B 0.04952603 -2.087353 2.087353 -1.983738 0.0003102699
## Pro.B.cells
## 816 0.002140182
## 817 0.005082932
## 818 0.003477796
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Some up terms can be linked to aberrant erythroid cells, down terms slcan be linked to differentiation/prolferation properties od pro-B cells
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Similar result with wilcoxon, slightly better GO terms
For Seurat I take in account the methods: - t-test - wilcoxon - logistic regression
Similar numbers of DE genes were found by Seurat’s different flavours.
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Better result with seurat, the top upregulated genes are strictly correlated with erythrocytes. Only CD74 is downregulated and correlated with immune cells. The low abs(LFC) genes still are not well evaluable. In average the different gene expression is better appreciable
GO terms are also better compared to scanpy
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Similar to the previous
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
As above.
Now let’s see the genes fuond uniquely by logreg/wx and t-test:
Unique in wx/logeg:
## # A tibble: 1 × 8
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster Symbol abs_lfc
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1.55e-11 0.319 0.289 0.407 0.000000237 Aberrant erythroid NEAT1 0.319
NEAT1 result upregulated in Aberrant erythroid cells only with wilcoxon/logreg method. The logFC are quite small (0.32) and the distribution is similar between the cell lines. Looking at the expression divided by the batch, no batch specific differential expression is available.
Unique in t-test:
## # A tibble: 5 × 8
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster Symbol abs_lfc
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1.11e-15 0.469 0.237 0.216 1.70e-11 Aberrant erythroid RNF19A 0.469
## 2 2.33e- 7 0.432 0.24 0.284 3.58e- 3 Aberrant erythroid MGST3 0.432
## 3 3.87e-12 0.311 0.207 0.175 5.93e- 8 Aberrant erythroid ZNF451 0.311
## 4 2.10e- 6 -0.313 0.398 0.391 3.22e- 2 Aberrant erythroid ODC1 0.313
## 5 2.39e- 7 -0.520 0.483 0.457 3.67e- 3 Aberrant erythroid REXO2 0.520
All the genes doesn’t have a clear difference in the expression among the batches, I will drop the seurat t-test and keep the logreg/wx
In the pseudobulks, counts were summed for each celltype and sample, in practice obtaining a column of counts for each celltype and sample. Then, genes that weren’t showing counts in more than 90% of the cells were removed. Also, genes showing expression below 10 counts in all the pseudobulks were removed (pseudobulks are the sum of cells counts. showing less than 10 counts indicate almost all the cells didn’t had count’s for that gene).
## [1] 11505 121
For DESeq2, firstly I run it with the “LRT” methods that showed better results here. The PCA plot looks good but the clustering mix some samples here. I run DESeq2 specifing the model to take account of the batches.
Then, I tested combat_seq to correct the batch effect here. Also, I decided to see
if there was a batch effect afrom unknown sources using the single
variable analysis (sva) here.
DESeq2 without batch correction found the least number of DE genes
compared to the methods tested until now. Also, the DEgenes found with
batch corrected methods differs from DESeq2.
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Genes with high and low logFC are well divided. Also, the genes with low abs(logDC) show better division between the cell types. BP terms aren’t really specific.
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Genes with high and low logFC are well divided. Also, the genes with low abs(logDC) show better division between the cell types. BP terms aren’t really specific.
BP terms aren’t really specific. read this for vescicole and autophagy.
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Despite both found ~4k DE genes, only a partial overlap of DE genes between DESeq2 + sva and Seurat. Also, the top lgFC genes have really poor expression.
Normal and with combat
Limma found more than 7k genes, no differences with the batch corrected data
Finally I tested edgeR with LRT as best option from here, I tested combat too. edgeR edgeR_combat
Both the methods found more than 9k genes with some hundreds uniquel per each method.
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Extreme LogFC show different results from what obtained with other algorithms, maybe because of the hand-written method. Low abs(logFC) are not appreciable. BP terms looks good
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
High number of DE genes found, the low(logFC) aren’t well distinguishable. GOOD BPs
RRMM_BM_8 has a lower expression of both the subunits… Interestingly is a longitudinal sample and we can look at the expression of the genes along the samples. Also, are those cells misslabelled? or the expression of HBA1/2 is lower in all the cell types. To check it I am loading the original dataset and plot the expression of all the samples from this patient:
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 6313873 337.2 11137894 594.9 11137894 594.9
## Vcells 11062202 84.4 58216616 444.2 72770761 555.2
Interestingly also the expression of HBA1/2 in other cell lines is lower in RR_BM_8, even looking at the other longitudinal data from the same patient